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ABSTRACT 

EnsembI Genomes (http://www.ensemblgenomes. 
org) is an integrating resource for genome-scale 
data from non-vertebrate species. The project 
exploits and extends technologies for genome an- 
notation, analysis and dissemination, developed in 
the context of the vertebrate-focused EnsembI 
project, and provides a complementary set of 
resources for non-vertebrate species through a 
consistent set of programmatic and interactive 
interfaces. These provide access to data including 
reference sequence, gene models, transcriptional 
data, polymorphisms and comparative analysis. 
This article provides an update to the previous pub- 
lications about the resource, with a focus on recent 
developments. These include the addition of import- 
ant new genomes (and related data sets) including 
crop plants, vectors of human disease and eukary- 
otic pathogens. In addition, the resource has scaled 
up its representation of bacterial genomes, and 
now includes the genomes of over 9000 bacteria. 
Specific extensions to the web and programmatic 
interfaces have been developed to support users 
in navigating these large data sets. Looking 
forward, analytic tools to allow targeted selection 
of data for visualization and download are likely to 



become increasingly important in future as the 
number of available genomes increases within all 
domains of life, and some of the challenges faced 
in representing bacterial data are likely to become 
commonplace for eukaryotes in future. 

OVERVIEW AND ACCESS 

EnsembI Genomes (http://www.enseniblgenomes.org) is 
organized as five sites, each focused on one of the trad- 
itional kingdoms of Ufe: bacteria (specific URL http:// 
bacteria.ensembl.org), protists, fungi, plants and (inverte- 
brate) metazoa. Vertebrate metazoa are the focus of the 
EnsembI project (http://www.ensembl.org) (1); EnsembI 
Genomes provides a complementary set of interfaces for 
non-vertebrate species. Core data available for all species 
include genome sequence and annotations of protein- 
coding and non-coding genes; additional data include 
transcriptional data, polymorphisms and comparative 
analysis. Interactive access is provided through a web 
interface providing genome browsing capabilities: users 
can scroll through a graphical representation of a DNA 
molecule at various levels of resolution, seeing the relative 
locations of features — including conceptual annotations 
[e.g. genes, single nucleotide polymorphism (SNP) loci], 
sequence patterns (e.g. repeats) and experimental 
data (e.g. sequences and external sequence features 
mapped onto the genome) — supporting the primary 
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annotations. Functional information is provided through 
direct curation, import from the UniProt Knowledgebase 
(2) or imputation from protein sequence [using the classi- 
fication tool InterProScan (3)]. Users can download much 
of the data available on each page in a variety of formats, 
and tools exist for upload of (various types of) user data, 
allowing users to see their own annotation in the context 
of the reference sequence. DNA- and protein-based 
sequence search are also available. 

The data are stored in a set of MySQL databases using 
the same schemas as those in use for the Ensembl project. 
Direct access to these is provided through a public 
MySQL server (mysql.ebi.ac.uk:4157; user 'anonymous') 
and additionally through well-developed Application 
Programming Interfaces (APIs) that provide an object- 
oriented framework for working with the data. Database 
dumps and common data sets (e.g. DNA, RNA and 
protein sequence sets and sequence alignments) can be 
directly downloaded in bulk via file transfer protocol 
(ftp://ftp.ensemblgenomes.org). 

Ensembl Genomes data are also made available through 
a series of data warehouses, optimized around common 
(gene and SNP-centric) queries, using the BioMart data 
warehousing system (4). The BioMart framework provides 
a series of interfaces, including web-based query building 
tools, for each of the Ensembl Genomes (eukaryotic) 
domains (e.g. at http://plants.ensembl.org/biomart/ 
martview) and a variety of other interfaces for interactive 
and programmatic access. BioMarts are not currently 
available for Ensembl Bacteria. 

Ensembl Genomes is released 4-5 times a year, in syn- 
chrony with releases of Ensembl, using the same software 
as the corresponding Ensembl release. The overall suite of 
Ensembl Genomes interfaces mirrors the interfaces 
provided for vertebrate genomes in Ensembl, and allows 
users access to genomic data from across the tree of life in 
a consistent manner. 



A COLLABORATIVE MODEL FOR 
GENOME-SCALE DATA 

The Ensembl Genomes project is driven by a number of 
domain-specific collaborations, each with a scientific com- 
munity with its own focus of interest. By working in part- 
nership with us, communities can benefit from a robust 
infrastructure and the integration of their data within a 
comprehensive service. These collaborations take a 
number of forms. In some domains, we work with our 
partners to develop a community-centric service, aimed at 
each community's specific needs, but also mirror key data 
within the central Ensembl Genomes portal. Examples of 
such collaborations include VectorBase (http://www. 
vectorbase.org) (5), a resource for the genomes of inverte- 
brate pathogens of human diseases; WormBase (http:// 
www.womibase.org) (6), which maintains resources for 
nematode genomes, especially the model species 
Caenorhabditis elegans; PomBase (http://www.pombase. 
org) (7), the model organism database for the fission 
yeast Schizosaccharomyces pombe; and PhytoPath (http:// 
www.pliytopathdb.org), a resource for plant pathogens, 



with a focus on fungi and oomycetes. In other domains, 
we collaborate more broadly with other integrative centers, 
with a goal of developing high-quality networks of inter- 
linked resources through the sharing of common reference 
data and standards for interoperability. In the context of 
Ensembl Plants, for example, we work closely with the 
Gramene database (http://www.gramene.org) (8) and a 
number of leading European plant genomics and informat- 
ics centers through the transPLANT project (http://www. 
transplantdb.eu). In addition, we contribute to many com- 
munity-driven projects to sequence, assemble and annotate 
particular genomes, and make the resulting data available 
through the Ensembl Genomes site. 

Ensembl Genomes prioritizes data for incorporation, 
according to scientific importance. The criteria for 
priority treatment are first, data relevant to our specific 
collaborations; second, data from other major experimen- 
tal species; and third, data from other species that provide 
local or remote evolutionary context for the priority 
species, and which are used to strengthen the comparative 
analysis provided in the site. For the first category of 
genomes, we actively work with our collaborators to 
produce the primary community-recognized annotation. 
For the second category, we supplement the reference an- 
notation (often maintained by model organism databases 
or other similar resources) with additional high-value data 
sets. For several species in these two categories, we have 
constructed variation databases, which store genotypes, 
loci and phenotypes from large-scale genome-wide array- 
based and resequencing studies, and have made the data 
available through specialized graphical views and an SNP- 
centric BioMart. Variation data are sourced from dbSNP 
(9) or Database of Genomic Variants archive (10) where 
available, or otherwise directly from the data producers. 
For the third category of genomes, annotation is generally 
incorporated from the original submitters with only 
limited enhancement (for example, the annotation of 
non-coding genes, if absent in the original submission). 

At the time of writing, there have been 10 releases of 
Ensembl Genomes since the previous report was pubhshed 
in this journal (1 1). The current release is release 20, made 
public in September 2013. In this time, there has been a 
significant increase in the content of all five Ensembl 
Genomes sites. 

Metazoa 

Nineteen new genomes have been added, including the 
sponge Amphimedon queenslandica, the south and central 
American malarial mosquito Anopheles darlingi, the leaf- 
cutter ant Atta cepahlotes, the silkworm Bombyx mori, the 
water flea Daphnia piile.x, the pacific oyster Crassostrea 
gigas, the owl limpet Lottia gigantea, the scuttle fly 
Megaselia scalaris, the centipede Strigamia maratima, the 
kissing bug Rhodnius prolixus, the red flour beetle 
Tribolium casteneum, the two-spotted spider mite 
Tetranychus uriticae, two annelid worms, two butterflies 
and three nematodes. Additional variation data (12,13) 
have been introduced for Anopheles gambiae, and new 
DNA-based comparative analysis has been added for 
nematodes. 
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Plants 

Twenty-two new genomes have been added, including 
Chinese cabbage (Brassica rapa); soy bean (Glycine 
max), barley {Hordeum vulgare), banana (Miisa 
acuminata), barrel clover (Medicago tnmculata), the club 
moss Selagninella moellenorfii, foxtail millet (Setaria 
ilatica), tomato (Solanum lycopersicum), potato (Solanum 
tuberosum), two species of rice, two diploid ancestors of 
hexaploid bread wheat and, as taxonomic outliers, two 
algal species. In addition, the preliminary genome 
assemblies, homeologous SNP calls and expressed 
sequence transcript (EST) sequences available for bread 
wheat have been aligned to the genomes of barley and 
Brachypodium distachyon, and a sequence search has 
been implemented against the EST sequences that 
visualized the results in the context of their ahgnments 
to these references. Variation databases have been 
provided for barley (14), maize (15), rice (16,17) and 
sorghum (18). The variation data set for Arabidopsis 
thaliana has been expanded to include additional data 
from the 1001 Genomes Project (19) and other work, 
including phenotypic data (20). Additional comparative 
alignments have been produced for cereal genomes. 

Fungi 

Twenty-four new genomes have been added, including 
20 plant pathogens {Blumeria graminis, Botrytis cornerea, 
Fusarium oxysporum, Gaeumannomyces graminis, 
Glomerella gramincola, Leptosphaeria maculans, 
Melampsora larcini-populina, Microbotryum violaceum, 
Nectria haematococca, Puccinia triticina, Sclerotina 
sclerotiorum, Sporisorium reilanium, Trichoderma reesei, 
Ustilago maydis, two species of Gibberella, two species of 
Magnaporthe and two species of Pyrenophora). Other 
species added include the human pathogen Cryptococcus 
neoformans, the truffle Tuber melanosporum and two add- 
itional yeast species. RNA-seq ahgnments (to the genome) 
have been added for P. triticina; EST alignments have 
been added for Phaeosphaeria nodorum, S. pombe, T. 
melanosporum and Zymoseptoria tritici; and new compara- 
tive genomic alignments have been added for certain 
Pyrenophora and yeast species. For phytopathogenic 
fungal (and protist) species, information about genes im- 
pacting on pathogenesis has been imported from the PHI- 
base database (21), and mutant and overexpression 
phenotypes are now represented in a color-coded form 
in the genome browser. 

Protists 

Eleven new genomes, including those of several important 
plant and human pathogens, have been added: Albugo 
laibachii. Entamoeba histolytica, Giardia lamblia, 
Guillardia theta, Hyaloperonspora arabidopsis, Leishmania 
major, Paramecium tetraurelia, Pythium ultimum, 
Tetrahymena thermophila. Toxoplasma gondii and 
Trypanosma brucei. New DNA alignments have been 
provided for the ciliates, the Peronsoporales and the 
Trypanosomatidae. A variation database has been added 
for Phytophthora infestans. 



Bacteria 

Ensembl Bacteria has been comprehensively expanded 
since release 17. Although previously the bacterial 
division of Ensembl had focused on a smaU number of 
selected clades, the division now contains all bacterial 
genomes that have been completely sequenced, annotated 
and submitted to the International Nucleotide Sequence 
databases (European Nucleotide Archive, GenBank and 
the DNA Database of Japan) (22), a total of 9089 
genomes in the latest release. Additional information is 
incorporated from the UniProtKB, InterPro, information 
about operons from RegulonDB (23) and about reaction 
catalysts from Microme (http://www.micromedb.eu). To 
ensure that data within this expanded set remain discov- 
erable, two new species selection mechanisms have been 
introduced into the portal, one using autocomplete and 
the other providing a taxonomically structured interface 
(illustrated in Figure 1). The latter also enables the restric- 
tion of (sequence and text) search to user-defined taxo- 
nomic segments. Additionally, the Ensembl Perl API has 
been extended with a new lookup module, allowing users 
to discover genomes matching their specifications (e.g. full 
or partial name-match, taxonomic identifier, nucleotide 
sequence accession) programmatically. Within the 
browser, an improved representation of transcripts and 
translations, capable of providing a correct representation 
of bacterial features (i.e. polycistronic transcripts and 
alternative translational initiation) has been introduced. 

IMPROVED TOOLS FOR DATA ACCESS 

A number of improvements to the Ensembl infrastructure 
have been made during the past year, including the intro- 
duction of a scrollable browser and a new RESTful API (a 
language-agnostic supplement to the existing Perl API), 
whereas the range of data formats provided (for appropri- 
ate data types) via file transfer protocol has been expanded 
to include Genome File Format and Variant Call Format. 
A new fast sequence search, based on a back-end provided 
by the European Nucleotide Archive, has been introduced 
for all species alongside a Basic Local Ahgnment Search 
Tool (BLAST) server. A feature allowing portions of gene 
trees to be highhghted based on the existence of common 
annotation has been introduced. Support has been 
introduced for annotations comprising structured 
assembhes of ontology terms (e.g. for complex phenotypic 
description), and a new browser has been implemented for 
ontological terms, which depicts the ancestry of annotated 
terms and provides hnks through to BioMart to allow 
users to retrieve gene sets annotated with any term in 
the display. Finally, automatic display of remote files is 
now supported for any data file using any known synonym 
to identify the reference sequence on which the data is to 
be visualized. 



COMPARATIVE ANALYSIS 

Extensive comparative analyses are performed between 
the sequences in Ensembl Genomes. Analyses include 
pairwise alignments between DNA sequences, using the 
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Figure 1. Species selection in Ensembl Bacteria. The figure shows the selection of a basket of genomes for use in a BLAST search. A tree-based 
navigation system allows the selection of defined portions of the taxonomy for use as library sequences. An autocomplete feature assists the location 
of particular genomes within the tree. 



tools LASTZ (24) and (for more diverged genomes) 
translated BLAT (25) combined with the use of the 
chain/net algorithm of Kent et. al (26). The number of 
these coinparisons has increased and we now have 118 
pairwise alignments. In Ensembl Plants, pairwise ahgn- 
nients are provided for rice against every other genome, 
A. thaliana against every other genome (except barley) and 
14 other pairwise comparisons. In Ensembl Metazoa, 
comparisons are provided from Drosophila melanogaster 
to 1 1 other drosophihd species and 4 mosquitoes, for all 
pairwise combinations of A. gambiae, A. darlingi, Aedes 
aegytptii and Cule.x quinquefasciatus, from C. elegans to 8 
other neinatodes and from Brugia malayi to Loa loa. In 
Ensembl Fungi, aU-against-aU alignments are available in 
the Aspergillus, Hypocreales, Puccinicdes, Pyrenophora 
clades and for Saccharomyces cerevisiae against Ashyha 
gossypii- In Ensembl Protists, DNA ahgnments are 
provided for each of three Phytophthora species against 
each other. No DNA-based comparisons are currently 
provided for bacterial species. 

Protein alignments are used to reconstruct evolutionary 
trees for related genes using the Enseinbl Compara Gene 
Trees pipehne (27). These are run for each eukaryotic 
domain and additionally for a representative selection of 
species from across the taxonomic space to identify widely 



conserved famihes and deep homologies between different 
evolutionary branches. In the current release, the 
pan-taxonomic database was constructed from the 
genomes of 12 chordates (11 vertebrates, plus Ciona 
intestinalis), 15 non-chordate metazoans, 7 plants, 7 
fungi, 8 protists, 98 bacteria and 25 archaea. Genoines 
are chosen for inclusion according to a variety of 
criteria, including mutual taxonomic distance, nuinber of 
recorded publications, prior inclusion in previous editions 
of the pan-taxonoinic Coinpara and overlap with the ref- 
erence proteome sets defined by the UniProt KB. In total, 
79 005 gene trees have been constructed for a total of 
1 070 325 proteins. Their distribution among the different 
taxonomic domains is shown in Figure 2. Bacterial 
proteins (from all included genomes) have additionally 
been grouped into famihes using the HAMAP (28) and 
Panther (29) resources. 

CEREALS: SERIOUSLY BIG GENOMES 

The genomes of several economically important crop 
species have not yet been completely sequenced owing to 
their large size and highly repetitive DNA. However, 
during the last year, early versions of the diploid barley 
genome (5 Gb) and the hexaploid bread wheat genome 
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Figure 2. Taxoiiomic distribution of gene families in tlie pan-taxo- 
nomic comparative analysis in release 19 of Ensembl Genomes. Large 
numbers of families [defined by clustering according to the Ensembl 
Gene Trees algorithm (27)] are found only in one domain of life. 
However, families can be found spanning all combination of 
domains. The most overrepresented spans (compared with expectations 
based on the same proportion of families being covering each domain, 
but assuming the co-coverage of two domains is random) are (i) all five 
domains and (ii) all four non-bacterial domains; the most 
underrepresented spans are (i) bacteria and metazoa and (ii) bacteria, 
metazoa and fungi. For each family of related proteins, a gene tree is 
constructed and made available for visualization and download, 
estimating the evolutionary history of that family. 

(16 Gb) have become available. Neither genome is yet 
available in a completely assembled form. The current 
barley genome assembly (14) consists of around 1.9 Gb 
of DNA in 612 267 contigs of over 200 bp, of which 
~ 400 Mb of which have been located on chromosome 
level using markers from extant physical and genetic 
maps. A total of 24 211 high-confidence protein-coding 
genes have been called, of which 64% are in anchored 
locations. The N50 is only 1405 bp, but the N50 of gene- 
containing scaffolds is much higher (8.4 Kb). Despite the 
fragmented nature of the genome, barley is represented 
conventionally in Ensembl Plants, with data shown at all 
levels from the karyotype through to comparative analysis 
and variation. In the absence of high-level scaffolding, 
approximate colocation of contigs to marker sequences 
located on the physical map is used to provide an approxi- 
mation of the order and orientation of contigs at each 
chromosomal locus. Additionally, unanchored contigs 
have been grouped together in a synthetic 'chromosome' 
(consisting of the actual contigs with arbitrary gaps 
between them) to better fit the data model (and critically, 
to improve analysis times), and contigs of <200 nucleo- 
tides have been excluded from the database. In all other 
respects, the genome can be accessed in the same way as 
any of the better-assembled genomes in the resource. 
A typical view of the barley genome in Ensembl Plants 
is depicted in Figure 3. 

The wheat genome assemblies pubhshed in late 2012 
(30) are even more fragmentary with many contigs, even 
those in genie regions, less extensive than the genes them- 
selves; thus, creation of accurate complete gene models 
is difficult. Therefore, in Ensembl Plants, we have 



presented the data in the form of alignments of the 
genomic contigs onto two better-assembled reference 
genomes, Brachypodium and barley. This enables the 
wheat sequences, and the homeologous variants 
(between the three wheat genomes) that have been 
identified, to be located in the context of full-length gene 
models predicted in these closely related species. 
Additionally, a set of 1.3 million wheat ESTs has been 
mapped onto the Brachypodium and barley genomes; 
and a sequence search facility provided against the 
wheat EST set that returns an alignment of the query 
sequence against the top-matching ESTs and additionally 
shows where those ESTs align to the genomes of these two 
related species. 



PERSPECTIVE AND PRIORITIES 

Over the next 2 years, we anticipate that increasingly 
complete versions of the liexaploid bread wheat genome 
will emerge from the efforts of the International 
Wheat Genome Sequencing Consortium (http://www. 
wheatgenome.org/), allowing for a transition from the 
current limited representation of the genome to a more 
complete representation. As such, wheat will be the first 
polyploid genome to be fully represented in the Ensembl 
system. Although the size and repetitive nature of the 
genome is a challenge in terms of assembly and annota- 
tion, the Ensembl database schema and interface are 
flexible and should accommodate data from polyploid 
species with only minor modification. Each of the three 
genomes will be separately analyzed in the Ensembl com- 
parative analysis pipelines, placing each genome separ- 
ately within each gene tree and identifying homeologous 
regions of DNA sequence that will be displayed in an 
integrated stacked visuahzation. The existing representa- 
tion of homeologous variants will be extended to show 
their functional consequence on gene models. 

A more significant challenge lies in data discovery, as 
the number of available genomes and data sets continues 
to rise: how can users discover whether information that 
might be of interest to them exists in the system? We an- 
ticipate an increased need for data analysis tools not just 
as an end in themselves but also as a route of data access. 
Users wiU not necessarily start their analyses knowing 
which genomes or genes they wish to work on; instead, 
they might wish to ask questions of the complete data set 
to identify genomes that differ in terms of their gene 
content or genes whose presence/absence/copy number 
difference differentiates two genomes (for example, to 
determine the difference between pathogenic and non- 
pathogenic strains of related organisms or between com- 
petent and non-competent vectors). Likewise, variant fea- 
tures will also be studied in the context of their presence or 
absence in certain individuals/populations. Supporting 
these use cases will require rapid gene classification 
(including common occurrences of novel gene famihes 
identified in multiple species) and a high-performance 
data warehouse to support analysis of the data and help 
users identify the features of interest within the total data 
set. Another increasingly important use case is likely to be 
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Figure 3. The barley genome represented in EnsembI Plants. The figure shows resequencing alignments from a nuinber of cultivars against the 
reference cultivar Morex genome assembly and annotation for a sequenced contig given approximate chromosomal location through integration with 
the genetic map. 



for the dynamic display of data on demand from archived 
analyses (e.g. sequence ahgnments, variant calls) selected 
on the basis of associated experimental metadata. 
Developing such tools will be a priority for EnsembI 
Genomes over the next years. 
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